The goal of this project was to apply interpretability models to the random forest classifier in the PeerJ article “Prioritizing bona fide bacterial small RNAs with machine learning classifiers” with the purposes of gaining deeper insights into the prioritization of bona fide bacterial sRNAs.
Side note:* When used the word original to refer to the RF model created in the base article, or to refer to models that were created(trained) in the same way as the base random forest model.
This section is all about getting the libraries and data needed to run this R notebook. We will preprocess the data and load the original model. We start with some OPTIONAL memory cleaning (useful if you want to run this from R Studio and with a clean environment)
# This section is just about getting the environment ready and cleaned up. Since this project was originally written as a regular R script, and worked from within R Studio, it was just usefull in resetting and cleaning up the the environment, and so this block of code is optional
clc <- function() cat("\014") ; clc()
remove(list = ls())
gc()
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 2479304 132.5 6565627 350.7 6565627 350.7
Vcells 5037059 38.5 26355833 201.1 65162034 497.2
The following code is based on the h2o documentation site, as it’s their recommended way for downloading and installing h2o for R, with only the first if being added as to prevent accidental reinstallations. Here is the link for those interested in it.
if (! ("h2o" %in% rownames(installed.packages()))) {
# The following two commands remove any previously installed H2O packages for R.
if ("package:h2o" %in% search()) { detach("package:h2o", unload=TRUE) }
if ("h2o" %in% rownames(installed.packages())) { remove.packages("h2o") }
# Next, we download packages that H2O depends on.
pkgs <- c("RCurl","jsonlite")
for (pkg in pkgs) {
if (! (pkg %in% rownames(installed.packages()))) { install.packages(pkg) }
}
# Now we download, install and initialize the H2O package for R.
install.packages("h2o", type="source", repos="http://h2o-release.s3.amazonaws.com/h2o/rel-zahradnik/4/R")
}
The rest of the libraries can be installed in the same way as with any other library.
packages <- c("lime", "ROCR", "PRROC", "randomForest", "randomForestExplainer", "png")
for (package in packages) {
if (! (package %in% rownames(installed.packages()))) { install.packages(package) }
}
library("h2o") # ML model building
library("ROCR") # ML evaluation
library("PRROC") # ML evaluation
library("randomForest") # ML model building
library("randomForestExplainer") # ML Global interpretation
library("lime") # ML local interpretation
library("png")
trainDataSet <- read.csv("./DataSets/combinedData.csv", header = TRUE)
trainDataSet[,"Class"] <- as.logical(trainDataSet[,"Class"]) # guarantees that our h2o model trains the random forest (RF) as a classification tree and not a regression tree
dataSetTrain_scaled <- scale(trainDataSet[,-c(5,7,8,9)], center = TRUE, scale = TRUE)
dataSetTrain_scaled <- cbind( dataSetTrain_scaled[,(1:4)],
trainDataSet[,5],
dataSetTrain_scaled[,5],
trainDataSet[,c(7,8,9)])
colnames(dataSetTrain_scaled) <- c("SS", "Pos10wrtsRNAStart", "DistTerm", "Distance",
"sameStrand", "DownDistance", "sameDownStrand", "ID", "Class")
dataSetTrain_scaled[,"Class"] <- as.logical(dataSetTrain_scaled[,"Class"]) # Just making sure Class is taken a logical value
rbind( head(dataSetTrain_scaled,5), tail(dataSetTrain_scaled,5) )
normalize <- function(x) {
return ((x - min(x)) / (max(x) - min(x)))
}
dataSetTrain_norm <- trainDataSet
dataSetTrain_norm$SS <- normalize(dataSetTrain_norm$SS)
dataSetTrain_norm$Pos10wrtsRNAStart <- normalize(dataSetTrain_norm$Pos10wrtsRNAStart)
dataSetTrain_norm$DistTerm <- normalize(dataSetTrain_norm$DistTerm)
dataSetTrain_norm$Distance <- normalize(dataSetTrain_norm$Distance)
dataSetTrain_norm$DownDistance <- normalize(dataSetTrain_norm$DownDistance)
dataSetTrain_norm[,"Class"] <- as.logical(dataSetTrain_norm[,"Class"]) # Just making sure Class is taken a logical value
rbind( head(dataSetTrain_norm,5), tail(dataSetTrain_norm,5) )
There are certain parameter() in the LIME functions that are recommended based on the distribution of the data. Based on these paremeters, we decided to check for normality in the data
dataSetTrainX <- trainDataSet[,-(8:9)]
dataSetTrainY <- trainDataSet[,(8:9)]
hist(x = as.numeric(dataSetTrainX[,1]), main = "Histogram of SS")
qqnorm(as.numeric(dataSetTrainX[,1]))
qqline(as.numeric(dataSetTrainX[,1]), col = "red", lwd = 2)
hist(x = as.numeric(dataSetTrainX[,2]), main = "Histogram of Pos10wrtsRNAStart")
qqnorm(as.numeric(dataSetTrainX[,2]))
qqline(as.numeric(dataSetTrainX[,2]), col = "red", lwd = 2)
hist(x = as.numeric(dataSetTrainX[,3]), main = "Histogram of DistTerm")
qqnorm(as.numeric(dataSetTrainX[,3]))
qqline(as.numeric(dataSetTrainX[,3]), col = "red", lwd = 2)
hist(x = as.numeric(dataSetTrainX[,4]), main = "Histogram of Distance")
qqnorm(as.numeric(dataSetTrainX[,4]))
qqline(as.numeric(dataSetTrainX[,4]), col = "red", lwd = 2)
hist(x = as.numeric(dataSetTrainX[,5]), main = "Histogram of sameStrand")
qqnorm(as.numeric(dataSetTrainX[,5]))
qqline(as.numeric(dataSetTrainX[,5]), col = "red", lwd = 2)
hist(x = as.numeric(dataSetTrainX[,6]), main = "Histogram of DownDistance")
qqnorm(as.numeric(dataSetTrainX[,6]))
qqline(as.numeric(dataSetTrainX[,6]), col = "red", lwd = 2)
hist(x = as.numeric(dataSetTrainX[,7]), main = "Histogram of sameDownStrand")
qqnorm(as.numeric(dataSetTrainX[,7]))
qqline(as.numeric(dataSetTrainX[,7]), col = "red", lwd = 2)
Load Original Testing Data Sets
slt2dataPos <- read.csv("./DataSets/SLT2_Positives.tsv", sep = "\t", header = TRUE)
slt2dataPos$Class <- rep(1,nrow(slt2dataPos))
slt2dataNeg <- read.csv("./DataSets/SLT2_Negatives.tsv", sep = "\t", header = TRUE)
slt2dataNeg$Class <- rep(0,nrow(slt2dataNeg))
slt2data <- rbind(slt2dataPos,slt2dataNeg)
ludataPos <- read.csv("./DataSets/Lu_Positives.tsv", sep = "\t", header = TRUE)
ludataPos$Class <- rep(1,nrow(ludataPos))
ludataNeg <- read.csv("./DataSets/Lu_Negatives.tsv", sep = "\t", header = TRUE)
ludataNeg$Class <- rep(0,nrow(ludataNeg))
ludata <- rbind(ludataPos,ludataNeg)
Even though the scale function can scale and center the data for us, we need to scale the testing data sets using the training data’s standard deviation and means to obtain adequate results.
Scaling Function: * Xi = An instance * Xmean = Mean/Average of all samples
* Xsd = Standard Deviation of our data set * Zi = Scaled Value of Xi * Zi = (Xi - Xmean) / Xsd
# Standard Deviation and Mean for Standarization
scale_td <- function( Xi, Xmean, Xsd) {
return ( (Xi - Xmean) / Xsd ) # Zi
}
trainData_sd_SS <- sd(trainDataSet[,"SS"])
trainData_sd_Pos <- sd(trainDataSet[,"Pos10wrtsRNAStart"])
trainData_sd_DisTerm <- sd(trainDataSet[,"DistTerm"])
trainData_sd_Dis <- sd(trainDataSet[,"Distance"])
trainData_sd_DownDis <- sd(trainDataSet[,"DownDistance"])
trainData_mean_SS <- mean(trainDataSet[,"SS"])
trainData_mean_Pos <- mean(trainDataSet[,"Pos10wrtsRNAStart"])
trainData_mean_DisTerm <- mean(trainDataSet[,"DistTerm"])
trainData_mean_Dis <- mean(trainDataSet[,"Distance"])
trainData_mean_DownDis <- mean(trainDataSet[,"DownDistance"])
slt2data_scaled <- slt2data
slt2data_scaled$SS <- scale_td(slt2data_scaled$SS,
trainData_mean_SS, trainData_sd_SS)
slt2data_scaled$Pos10wrtsRNAStart <- scale_td(slt2data_scaled$Pos10wrtsRNAStart,
trainData_mean_Pos, trainData_sd_Pos)
slt2data_scaled$DistTerm <- scale_td(slt2data_scaled$DistTerm,
trainData_mean_DisTerm, trainData_sd_DisTerm)
slt2data_scaled$Distance <- scale_td(slt2data_scaled$Distance,
trainData_mean_Dis, trainData_sd_Dis)
slt2data_scaled$DownDistance <- scale_td(slt2data_scaled$DownDistance,
trainData_mean_DownDis, trainData_sd_DownDis)
ludata_scaled <- ludata
ludata_scaled$SS <- scale_td(ludata_scaled$SS,
trainData_mean_SS, trainData_sd_SS)
ludata_scaled$Pos10wrtsRNAStart <- scale_td(ludata_scaled$Pos10wrtsRNAStart,
trainData_mean_Pos, trainData_sd_Pos)
ludata_scaled$DistTerm <- scale_td(ludata_scaled$DistTerm,
trainData_mean_DisTerm, trainData_sd_DisTerm)
ludata_scaled$Distance <- scale_td(ludata_scaled$Distance,
trainData_mean_Dis, trainData_sd_Dis)
ludata_scaled$DownDistance <- scale_td(ludata_scaled$DownDistance,
trainData_mean_DownDis, trainData_sd_DownDis)
rbind( head(slt2data_scaled,5), tail(slt2data_scaled,5))
rbind( head(ludata_scaled,5), tail(ludata_scaled,5))
In a similar fashion to what we did in 3.6, we also need to use the training data’s max and min values to properly normalize each feature.
normalize_td <- function(x, min, max) {
return ( (x - min) / (max - min) )
}
trainData_max_SS <- max(trainDataSet[,"SS"])
trainData_max_Pos <- max(trainDataSet[,"Pos10wrtsRNAStart"])
trainData_max_DisTerm <- max(trainDataSet[,"DistTerm"])
trainData_max_Dis <- max(trainDataSet[,"Distance"])
trainData_max_DownDis <- max(trainDataSet[,"DownDistance"])
trainData_min_SS <- min(trainDataSet[,"SS"])
trainData_min_Pos <- min(trainDataSet[,"Pos10wrtsRNAStart"])
trainData_min_DisTerm <- min(trainDataSet[,"DistTerm"])
trainData_min_Dis <- min(trainDataSet[,"Distance"])
trainData_min_DownDis <- min(trainDataSet[,"DownDistance"])
slt2data_norm <- slt2data
slt2data_norm$SS <- normalize_td(slt2data_norm$SS,
trainData_min_SS, trainData_max_SS)
slt2data_norm$Pos10wrtsRNAStart <- normalize_td(slt2data_norm$Pos10wrtsRNAStart,
trainData_min_Pos, trainData_max_Pos)
slt2data_norm$DistTerm <- normalize_td(slt2data_norm$DistTerm,
trainData_min_DisTerm, trainData_max_DisTerm)
slt2data_norm$Distance <- normalize_td(slt2data_norm$Distance,
trainData_min_Dis, trainData_max_Dis)
slt2data_norm$DownDistance <- normalize_td(slt2data_norm$DownDistance,
trainData_min_DownDis, trainData_max_DownDis)
ludata_norm <- ludata
ludata_norm$SS <- normalize_td(ludata_norm$SS,
trainData_min_SS, trainData_max_SS)
ludata_norm$Pos10wrtsRNAStart <- normalize_td(ludata_norm$Pos10wrtsRNAStart,
trainData_min_Pos, trainData_max_Pos)
ludata_norm$DistTerm <- normalize_td(ludata_norm$DistTerm,
trainData_min_DisTerm, trainData_max_DisTerm)
ludata_norm$Distance <- normalize_td(ludata_norm$Distance,
trainData_min_Dis, trainData_max_Dis)
ludata_norm$DownDistance <- normalize_td(ludata_norm$DownDistance,
trainData_min_DownDis, trainData_max_DownDis)
rbind( head(slt2data_norm,5), tail(slt2data_norm,5))
rbind( head(ludata_norm,5), tail(ludata_norm,5))
The original article, titled “Prioritizing bona fide bacterial small RNAs with machine learning classifiers”, and the source materials(training data, testing data, R scripts, .rds file, etc.) can all be found in PeerJ under the following link. The original .rds file can also be downloaded from github.
We can either retrain the original model, or load the .rds file provided in the original article. For the purpose of saving time, we simply loaded the model.
origRF <- readRDS("RF_classifier4sRNA.rds")
Since some of the IM models required scaling or normalized data to properly function, we needed to create 2 new models in the same way as the one loaded in 4.1.
tuneRF(dataSetTrainX[,c(1:7)], y = factor(dataSetTrainY[,2]), ntreeTry = 400, mtryStart = 2)
mtry = 2 OOB error = 10.74%
Searching left ...
mtry = 1 OOB error = 11.5%
-0.07142857 0.05
Searching right ...
mtry = 4 OOB error = 10.43%
0.02857143 0.05
mtry OOBError
1.OOB 1 0.1150307
2.OOB 2 0.1073620
4.OOB 4 0.1042945
set.seed(1234)
origRF_scaled <- randomForest(x = dataSetTrain_scaled[,c(1:7)], y = factor(dataSetTrainY[,2]), mtry = 2, ntree = 400, importance = TRUE)
origRF_norm <- randomForest(x = dataSetTrain_norm[,c(1:7)], y = factor(dataSetTrainY[,2]), mtry = 2, ntree = 400, importance = TRUE)
Since we are going to use the h2o library to perform some of our machine learning interpretability, we need to train an equivalent one, using the h2o tools.
h2o.init( # Initialize h2o
nthreads = -1, # -1 = use all available threads
max_mem_size = "8G" # specify the amount of memory to use
)
Connection successful!
R is connected to the H2O cluster:
H2O cluster uptime: 3 days 10 hours
H2O cluster timezone: America/St_Johns
H2O data parsing timezone: UTC
H2O cluster version: 3.30.0.4
H2O cluster version age: 1 month and 4 days
H2O cluster name: H2O_started_from_R_carlos.salcedo_lix921
H2O cluster total nodes: 1
H2O cluster total memory: 6.82 GB
H2O cluster total cores: 8
H2O cluster allowed cores: 8
H2O cluster healthy: TRUE
H2O Connection ip: localhost
H2O Connection port: 54321
H2O Connection proxy: NA
H2O Internal Security: FALSE
H2O API Extensions: Amazon S3, XGBoost, Algos, AutoML, Core V3, TargetEncoder, Core V4
R Version: R version 3.6.1 (2019-07-05)
h2o.removeAll() # clean up the system, h2o-wise (ie. kill any running h2o clusters)
h2o.no_progress()
trainData <- as.h2o(trainDataSet)
trainData_scaled <- as.h2o(dataSetTrain_scaled)
trainData_norm <- as.h2o(dataSetTrain_norm)
rfh2o <- h2o.randomForest(
training_frame = trainData, # training using the normal data
x = 1:7, # features to use to generate the prediction
y = 9, # Class type -> what we want to predict
model_id = "rf1_sRNA", # name of model in h2o
ntrees = 400, # max number of trees
seed = 1234, # seed, has to be set WITHIN the h2o function and it's supposed to be different from "R's seed", so results might not be exactly the same as orig model, but should be similar enough
mtries = 2, # Same as original model
max_depth = 30
)
rfh2o_scaled <- h2o.randomForest(
training_frame = trainData_scaled,
x = 1:7,
y = 9,
model_id = "rf2_sRNA",
ntrees = 400,
seed = 1234,
mtries = 2,
max_depth = 30
)
rfh2o_norm <- h2o.randomForest(
training_frame = trainData_norm,
x = 1:7,
y = 9,
model_id = "rf3_sRNA",
ntrees = 400,
seed = 1234,
mtries = 2,
max_depth = 30
)
As we were doing tests with LIME, we discovered that it yielded irregular responses, and as a result, the explanations could not be trusted at the time. It seems to be a known issue that LIME is not perfect for every model, and that it struggles with data that is highly diverse. In this use case, our data contains mixed numerical and categorical features, significantly different scales and magnitudes in our numercial features, and an inbalanced data sets that favors predictions of sRNAs being false. Since lime creates a local interpretable linear model for its explanations, the GLM constructed here was simply to verify the hypothesis that a linear model would be unable to explain the data. The “unbalanced-ness” seems to create models where the features tend always support a result of “false”, while claiming that all the features are against a result of true, which will be evicenced in section 15.1.
glmh2o <- h2o.glm( # https://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/glm.html
training_frame = trainData, # training using the normal data
x = 1:7, # features to use to generate the prediction
y = 9, # Class type -> what we want to predict
model_id = "glm_sRNA", # name of model in h2o
seed = 1234,
family = "binomial"
)
glmh2o_perf <- h2o.performance(glmh2o)
glmh2o_perf
H2OBinomialMetrics: glm
** Reported on training data. **
MSE: 0.1476104
RMSE: 0.3842009
LogLoss: 0.4680047
Mean Per-Class Error: 0.2607362
AUC: 0.7639542
AUCPR: 0.5749941
Gini: 0.5279085
R^2: 0.2127447
Residual Deviance: 610.2781
AIC: 626.2781
Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
Maximum Metrics: Maximum metrics at their respective thresholds
Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
h2o.coef(glmh2o)
Intercept SS Pos10wrtsRNAStart DistTerm Distance sameStrand DownDistance sameDownStrand
-5.588791e-02 -1.618068e-02 4.486513e-05 -2.304662e-03 -5.368581e-05 -4.228112e-01 1.958833e-04 2.178087e-01
h2o.coef_norm(glmh2o)
Intercept SS Pos10wrtsRNAStart DistTerm Distance sameStrand DownDistance sameDownStrand
-1.31299148 -0.37876153 0.01831303 -0.93619002 -0.02021210 -0.21155895 0.08702437 0.10892590
glmh2o@model$coefficients_table
Coefficients: glm coefficients
h2o.r2(glmh2o) # R^2 value is really low, and having a Max Recall being really low, meaning that LIME is also predicting that everything should be false
[1] 0.2127447
This is the performance with the OOB error, based on the training process (ie training data)
rfh2o
Model Details:
==============
H2OBinomialModel: drf
Model ID: rf1_sRNA
Model Summary:
H2OBinomialMetrics: drf
** Reported on training data. **
** Metrics reported on Out-Of-Bag training samples **
MSE: 0.08046297
RMSE: 0.28366
LogLoss: 0.2701744
Mean Per-Class Error: 0.1247444
AUC: 0.9370883
AUCPR: 0.8592391
Gini: 0.8741767
R^2: 0.5708641
Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
Maximum Metrics: Maximum metrics at their respective thresholds
Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
rfh2o@model$variable_importances
Variable Importances:
h2o.varimp_plot(rfh2o)
rfh2o_training_peformance <- h2o.performance(rfh2o)
rfh2o_training_peformance
H2OBinomialMetrics: drf
** Reported on training data. **
** Metrics reported on Out-Of-Bag training samples **
MSE: 0.08046297
RMSE: 0.28366
LogLoss: 0.2701744
Mean Per-Class Error: 0.1247444
AUC: 0.9370883
AUCPR: 0.8592391
Gini: 0.8741767
R^2: 0.5708641
Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
Maximum Metrics: Maximum metrics at their respective thresholds
Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
In this section we’ll get the predictions all the models generate on the testing data LU and SLT2. The goal of this section is not to prove which model is the best, but to prove that the models are similar enough to the point that the analysis of the H2O model will be a valid analogy model for the original RF model.
origRF_lu_pred <- predict(origRF, ludata[,-8], type = "prob")
origRF_slt2_pred <- predict(origRF, slt2data[,-8], type = "prob")
origRF_scaled_lu_pred <- predict(origRF_scaled, ludata_scaled[,-8], type = "prob")
origRF_scaled_slt2_pred <- predict(origRF_scaled, slt2data_scaled[,-8], type = "prob")
origRF_norm_lu_pred <- predict(origRF_norm, ludata_norm[,-8], type = "prob")
origRF_norm_slt2_pred <- predict(origRF_norm, slt2data_norm[,-8], type = "prob")
rfh2o_slt2_pred <- h2o.predict(object = rfh2o, newdata = as.h2o(slt2data[,-8]) )
rfh2o_lu_pred <- h2o.predict(object = rfh2o, newdata = as.h2o(ludata[,-8]) )
rfh2o_scaled_slt2_pred <- h2o.predict(object = rfh2o_scaled, newdata = as.h2o(slt2data_scaled[,-8]) )
rfh2o_scaled_lu_pred <- h2o.predict(object = rfh2o_scaled, newdata = as.h2o(ludata_scaled[,-8]) )
rfh2o_norm_slt2_pred <- h2o.predict(object = rfh2o_norm, newdata = as.h2o(slt2data_norm[,-8]) )
rfh2o_norm_lu_pred <- h2o.predict(object = rfh2o_norm, newdata = as.h2o(ludata_norm[,-8]) )
glm_slt2_pred <- h2o.predict(object = glmh2o, newdata = as.h2o(slt2data[,-8]) )
glm_lu_pred <- h2o.predict(object = glmh2o, newdata = as.h2o(ludata[,-8]) )
This function is to simplify the comparison between the predictions from the Original RF model and the H2O RF model. Later on this, this function will be used to compare the predictions between the OrigRF model and the other models.
# This function will be used later on, and
# a = the original model's predictions
# b = the alternate model's predictions
# c = the correct prediction
compareAnswers <- function(a,b,c){
if ( a == c & b == c )
{ res="BOTH_RIGHT" }
else if ( a == c & b != c )
{ res="OnlyA" }
else if ( a != c & b == c )
{ res="OnlyB" }
else
{ res="BOTH_WRONG" }
return(res)
}
slt2_predictions <- cbind(as.data.frame(slt2data), # Input
as.data.frame(origRF_slt2_pred[,2]), # Orig RF predictions
as.data.frame(origRF_scaled_slt2_pred[,2]), # Orig RF Scaled predictions
as.data.frame(origRF_norm_slt2_pred[,2]), # Orig RF Normalized predictions
as.data.frame(rfh2o_slt2_pred[,3]), # H2O RF predictions
as.data.frame(rfh2o_scaled_slt2_pred[,3]), # Scaled RF predictions
as.data.frame(rfh2o_norm_slt2_pred[,3]) # Normalized RF predictions
)
colnames(slt2_predictions) <- c("SS", "Pos10wrtsRNAStart", "DistTerm", "Distance",
"sameStrand", "DownDistance", "sameDownStrand", "Class",
"OrigRF_P","OrigRF_scaled_P","OrigRF_norm_P",
"RF_H2O_P","RFH2O_scaled_P","RFH2O_norm_P")
slt2_predictions$origPreds <- ifelse( slt2_predictions$OrigRF_P >= 0.5, 1, 0)
slt2_predictions$origScaledPreds <- ifelse( slt2_predictions$OrigRF_scaled_P >= 0.5, 1, 0)
slt2_predictions$origNormPreds <- ifelse( slt2_predictions$OrigRF_norm_P >= 0.5, 1, 0)
slt2_predictions$h2oPreds <- ifelse( slt2_predictions$RF_H2O_P >= 0.5, 1, 0)
slt2_predictions$h2oScaledPreds <- ifelse( slt2_predictions$RFH2O_scaled_P >= 0.5, 1, 0)
slt2_predictions$h2oNormPreds <- ifelse( slt2_predictions$RFH2O_norm_P >= 0.5, 1, 0)
slt2_predictions$origVsH2O <- NA
slt2_predictions$origVsOrigScaled <- NA
slt2_predictions$origVsOrigNorm <- NA
slt2_predictions$origVsH2OScaled <- NA
slt2_predictions$origVsH2ONorm <- NA
for( i in 1:nrow(slt2_predictions) ){
# Based on the way we are feeding the values to the compareAnswers function,
# Similiarities = OnlyA means that only the Orig RF got the right answer
# Similiarities = OnlyB means that only the other RF got the right answer
# All other answers will tell us where both models were right("BOTH_RIGHT"), or
# wrong ("BOTH_WRONG")
slt2_predictions[i,]$origVsOrigScaled <- compareAnswers(
slt2_predictions[i,]$origPreds, slt2_predictions[i,]$origScaledPreds,slt2_predictions[i,]$Class )
slt2_predictions[i,]$origVsOrigNorm <- compareAnswers(
slt2_predictions[i,]$origPreds, slt2_predictions[i,]$origNormPreds, slt2_predictions[i,]$Class )
slt2_predictions[i,]$origVsH2O <- compareAnswers(
slt2_predictions[i,]$origPreds, slt2_predictions[i,]$h2oPreds, slt2_predictions[i,]$Class )
slt2_predictions[i,]$origVsH2OScaled <- compareAnswers(
slt2_predictions[i,]$origPreds, slt2_predictions[i,]$h2oScaledPreds, slt2_predictions[i,]$Class )
slt2_predictions[i,]$origVsH2ONorm <- compareAnswers(
slt2_predictions[i,]$origPreds, slt2_predictions[i,]$h2oNormPreds, slt2_predictions[i,]$Class )
}
rbind( head(slt2_predictions, 5), tail(slt2_predictions, 5) )
str(slt2_predictions)
'data.frame': 1986 obs. of 25 variables:
$ SS : num -48.3 -8.8 -37.4 -26.3 -41.7 ...
$ Pos10wrtsRNAStart: int -56 -14 119 -15 -23 -48 -36 77 -15 -14 ...
$ DistTerm : int 0 1000 0 0 1000 1000 112 0 0 0 ...
$ Distance : int 0 -24 -98 -47 -2 -23 -53 -22 -71 -28 ...
$ sameStrand : int 1 0 0 1 0 0 0 0 1 0 ...
$ DownDistance : int 14 0 20 17 1 47 245 0 0 175 ...
$ sameDownStrand : int 1 0 0 1 1 0 0 0 1 0 ...
$ Class : num 1 1 1 1 1 1 1 1 1 1 ...
$ OrigRF_P : num 0.73 0.383 0.775 0.985 0.57 ...
$ OrigRF_scaled_P : num 0.745 0.362 0.72 0.98 0.58 ...
$ OrigRF_norm_P : num 0.743 0.328 0.748 0.98 0.61 ...
$ RF_H2O_P : num 0.74 0.372 0.705 0.973 0.539 ...
$ RFH2O_scaled_P : num 0.746 0.364 0.71 0.98 0.437 ...
$ RFH2O_norm_P : num 0.75 0.375 0.708 0.979 0.438 ...
$ origPreds : num 1 0 1 1 1 1 1 1 1 1 ...
$ origScaledPreds : num 1 0 1 1 1 1 1 1 1 1 ...
$ origNormPreds : num 1 0 1 1 1 1 1 1 1 1 ...
$ h2oPreds : num 1 0 1 1 1 1 1 1 1 1 ...
$ h2oScaledPreds : num 1 0 1 1 0 1 1 1 1 1 ...
$ h2oNormPreds : num 1 0 1 1 0 1 1 1 1 1 ...
$ origVsH2O : chr "BOTH_RIGHT" "BOTH_WRONG" "BOTH_RIGHT" "BOTH_RIGHT" ...
$ origVsOrigScaled : chr "BOTH_RIGHT" "BOTH_WRONG" "BOTH_RIGHT" "BOTH_RIGHT" ...
$ origVsOrigNorm : chr "BOTH_RIGHT" "BOTH_WRONG" "BOTH_RIGHT" "BOTH_RIGHT" ...
$ origVsH2OScaled : chr "BOTH_RIGHT" "BOTH_WRONG" "BOTH_RIGHT" "BOTH_RIGHT" ...
$ origVsH2ONorm : chr "BOTH_RIGHT" "BOTH_WRONG" "BOTH_RIGHT" "BOTH_RIGHT" ...
As an additional check, we also reviewed the correlations between the OrigRF predictions, and the alternative models’s predictions with the SLT2 predictions.
cor(slt2_predictions[,"OrigRF_P"],slt2_predictions[,"RF_H2O_P"])
[1] 0.9842173
cor(slt2_predictions[,"OrigRF_P"],slt2_predictions[,"OrigRF_scaled_P"])
[1] 0.9979076
cor(slt2_predictions[,"OrigRF_P"],slt2_predictions[,"OrigRF_norm_P"])
[1] 0.9976877
cor(slt2_predictions[,"OrigRF_P"],slt2_predictions[,"RFH2O_scaled_P"])
[1] 0.9846473
cor(slt2_predictions[,"OrigRF_P"],slt2_predictions[,"RFH2O_norm_P"])
[1] 0.9847233
Here, we are simply taking a look at how many instances the OrigRF model got correct vs the other models, and how many instances both models predicted correctly or wrong. * A value of OnlyA means that only the orig RF got the right answer * A value of OnlyB means that only the other RF got the right answer * All other answers will tell us where both models were right (“BOTH_RIGHT”) or wrong (“BOTH_WRONG”)
nrow( slt2_predictions[slt2_predictions$origVsOrigScaled == "OnlyA",] )
[1] 8
nrow( slt2_predictions[slt2_predictions$origVsOrigScaled == "OnlyB",] )
[1] 8
nrow( slt2_predictions[slt2_predictions$origVsOrigScaled == "BOTH_WRONG",] )
[1] 161
nrow( slt2_predictions[slt2_predictions$origVsOrigScaled == "BOTH_RIGHT",] )
[1] 1809
nrow( slt2_predictions[slt2_predictions$origVsOrigNorm == "OnlyA",] )
[1] 8
nrow( slt2_predictions[slt2_predictions$origVsOrigNorm == "OnlyB",] )
[1] 12
nrow( slt2_predictions[slt2_predictions$origVsOrigNorm == "BOTH_WRONG",] )
[1] 157
nrow( slt2_predictions[slt2_predictions$origVsOrigNorm == "BOTH_RIGHT",] )
[1] 1809
nrow( slt2_predictions[slt2_predictions$origVsH2O == "OnlyA",] )
[1] 26
nrow( slt2_predictions[slt2_predictions$origVsH2O == "OnlyB",] )
[1] 19
nrow( slt2_predictions[slt2_predictions$origVsH2O == "BOTH_WRONG",] )
[1] 150
nrow( slt2_predictions[slt2_predictions$origVsH2O == "BOTH_RIGHT",] )
[1] 1791
nrow( slt2_predictions[slt2_predictions$origVsH2OScaled == "OnlyA",] )
[1] 29
nrow( slt2_predictions[slt2_predictions$origVsH2OScaled == "OnlyB",] )
[1] 19
nrow( slt2_predictions[slt2_predictions$origVsH2OScaled == "BOTH_WRONG",] )
[1] 150
nrow( slt2_predictions[slt2_predictions$origVsH2OScaled == "BOTH_RIGHT",] )
[1] 1788
nrow( slt2_predictions[slt2_predictions$origVsH2ONorm == "OnlyA",] )
[1] 32
nrow( slt2_predictions[slt2_predictions$origVsH2ONorm == "OnlyB",] )
[1] 19
nrow( slt2_predictions[slt2_predictions$origVsH2ONorm == "BOTH_WRONG",] )
[1] 150
nrow( slt2_predictions[slt2_predictions$origVsH2ONorm == "BOTH_RIGHT",] )
[1] 1785
lu_predictions <- cbind(as.data.frame(ludata), # Input
as.data.frame(origRF_lu_pred[,2]), # Orig RF predictions
as.data.frame(origRF_scaled_lu_pred[,2]), # Orig RF Scaled predictions
as.data.frame(origRF_norm_lu_pred[,2]), # Orig RF Normalized predictions
as.data.frame(rfh2o_lu_pred[,3]), # H2O RF predictions
as.data.frame(rfh2o_scaled_lu_pred[,3]), # Scaled RF predictions
as.data.frame(rfh2o_norm_lu_pred[,3]) # Normalized RF predictions
)
colnames(lu_predictions) <- c("SS", "Pos10wrtsRNAStart", "DistTerm", "Distance",
"sameStrand", "DownDistance", "sameDownStrand", "Class",
"OrigRF_P","OrigRF_scaled_P","OrigRF_norm_P",
"RF_H2O_P","RFH2O_scaled_P","RFH2O_norm_P")
lu_predictions$origPreds <- ifelse( lu_predictions$OrigRF_P >= 0.5, 1, 0)
lu_predictions$origScaledPreds <- ifelse( lu_predictions$OrigRF_scaled_P >= 0.5, 1, 0)
lu_predictions$origNormPreds <- ifelse( lu_predictions$OrigRF_norm_P >= 0.5, 1, 0)
lu_predictions$h2oPreds <- ifelse( lu_predictions$RF_H2O_P >= 0.5, 1, 0)
lu_predictions$h2oScaledPreds <- ifelse( lu_predictions$RFH2O_scaled_P >= 0.5, 1, 0)
lu_predictions$h2oNormPreds <- ifelse( lu_predictions$RFH2O_norm_P >= 0.5, 1, 0)
lu_predictions$origVsH2O <- NA
lu_predictions$origVsOrigScaled <- NA
lu_predictions$origVsOrigNorm <- NA
lu_predictions$origVsH2OScaled <- NA
lu_predictions$origVsH2ONorm <- NA
for( i in 1:nrow(lu_predictions) ){
# Based on the way we are feeding the values to the compareAnswers function,
# Similiarities = OnlyA means that only the Orig RF got the right answer
# Similiarities = OnlyB means that only the other RF got the right answer
# All other answers will tell us where both models were right("BOTH_RIGHT"), or
# wrong ("BOTH_WRONG")
lu_predictions[i,]$origVsOrigScaled <- compareAnswers(
lu_predictions[i,]$origPreds, lu_predictions[i,]$origScaledPreds,lu_predictions[i,]$Class )
lu_predictions[i,]$origVsOrigNorm <- compareAnswers(
lu_predictions[i,]$origPreds, lu_predictions[i,]$origNormPreds, lu_predictions[i,]$Class )
lu_predictions[i,]$origVsH2O <- compareAnswers(
lu_predictions[i,]$origPreds, lu_predictions[i,]$h2oPreds, lu_predictions[i,]$Class )
lu_predictions[i,]$origVsH2OScaled <- compareAnswers(
lu_predictions[i,]$origPreds, lu_predictions[i,]$h2oScaledPreds, lu_predictions[i,]$Class )
lu_predictions[i,]$origVsH2ONorm <- compareAnswers(
lu_predictions[i,]$origPreds, lu_predictions[i,]$h2oNormPreds, lu_predictions[i,]$Class )
}
str(lu_predictions)
'data.frame': 3009 obs. of 25 variables:
$ SS : num -174.1 -30.9 -117.6 -30 -33.1 ...
$ Pos10wrtsRNAStart: int -18 -1000 -24 -26 56 -94 -35 12 77 244 ...
$ DistTerm : int 14 445 36 1000 1000 9 6 494 19 1000 ...
$ Distance : int -14 -1217 -292 -121 -112 -32 -83 -65 0 -121 ...
$ sameStrand : int 1 0 0 0 0 1 1 1 1 1 ...
$ DownDistance : int 68 379 18 228 61 74 71 41 0 131 ...
$ sameDownStrand : int 0 1 1 1 1 1 1 1 1 0 ...
$ Class : num 1 1 1 1 1 1 1 1 1 1 ...
$ OrigRF_P : num 0.968 0.295 0.892 0.855 0.695 ...
$ OrigRF_scaled_P : num 0.948 0.305 0.897 0.895 0.72 ...
$ OrigRF_norm_P : num 0.958 0.285 0.91 0.83 0.738 ...
$ RF_H2O_P : num 0.888 0.286 0.879 0.869 0.648 ...
$ RFH2O_scaled_P : num 0.882 0.301 0.877 0.849 0.613 ...
$ RFH2O_norm_P : num 0.885 0.302 0.87 0.851 0.622 ...
$ origPreds : num 1 0 1 1 1 1 1 1 0 0 ...
$ origScaledPreds : num 1 0 1 1 1 1 1 1 0 0 ...
$ origNormPreds : num 1 0 1 1 1 1 1 1 0 0 ...
$ h2oPreds : num 1 0 1 1 1 1 1 1 0 0 ...
$ h2oScaledPreds : num 1 0 1 1 1 1 1 1 0 0 ...
$ h2oNormPreds : num 1 0 1 1 1 1 1 1 0 0 ...
$ origVsH2O : chr "BOTH_RIGHT" "BOTH_WRONG" "BOTH_RIGHT" "BOTH_RIGHT" ...
$ origVsOrigScaled : chr "BOTH_RIGHT" "BOTH_WRONG" "BOTH_RIGHT" "BOTH_RIGHT" ...
$ origVsOrigNorm : chr "BOTH_RIGHT" "BOTH_WRONG" "BOTH_RIGHT" "BOTH_RIGHT" ...
$ origVsH2OScaled : chr "BOTH_RIGHT" "BOTH_WRONG" "BOTH_RIGHT" "BOTH_RIGHT" ...
$ origVsH2ONorm : chr "BOTH_RIGHT" "BOTH_WRONG" "BOTH_RIGHT" "BOTH_RIGHT" ...
rbind( head(lu_predictions, 5), tail(lu_predictions, 5))
As an additional check, we also review the correlations between the OrigRF predictions, and the alternative models’ predictions with the LU data.
cor(lu_predictions[,"OrigRF_P"],lu_predictions[,"RF_H2O_P"])
[1] 0.9870177
cor(lu_predictions[,"OrigRF_P"],lu_predictions[,"OrigRF_scaled_P"])
[1] 0.9977671
cor(lu_predictions[,"OrigRF_P"],lu_predictions[,"OrigRF_norm_P"])
[1] 0.9980152
cor(lu_predictions[,"OrigRF_P"],lu_predictions[,"RFH2O_scaled_P"])
[1] 0.9863546
cor(lu_predictions[,"OrigRF_P"],lu_predictions[,"RFH2O_norm_P"])
[1] 0.9866413
Basically, the same explanation as in section 10.2 * A value of OnlyA means that only the orig RF got the right answer * A value of OnlyB means that only the other RF got the right answer * All other answers will tell us where both models were right (“BOTH_RIGHT”) or wrong (“BOTH_WRONG”)
nrow( lu_predictions[lu_predictions$origVsOrigScaled == "OnlyA",] )
[1] 9
nrow( lu_predictions[lu_predictions$origVsOrigScaled == "OnlyB",] )
[1] 19
nrow( lu_predictions[lu_predictions$origVsOrigScaled == "BOTH_WRONG",] )
[1] 398
nrow( lu_predictions[lu_predictions$origVsOrigScaled == "BOTH_RIGHT",] )
[1] 2583
nrow( lu_predictions[lu_predictions$origVsOrigNorm == "OnlyA",] )
[1] 13
nrow( lu_predictions[lu_predictions$origVsOrigNorm == "OnlyB",] )
[1] 20
nrow( lu_predictions[lu_predictions$origVsOrigNorm == "BOTH_WRONG",] )
[1] 397
nrow( lu_predictions[lu_predictions$origVsOrigNorm == "BOTH_RIGHT",] )
[1] 2579
nrow( lu_predictions[lu_predictions$origVsH2O == "OnlyA",] )
[1] 35
nrow( lu_predictions[lu_predictions$origVsH2O == "OnlyB",] )
[1] 33
nrow( lu_predictions[lu_predictions$origVsH2O == "BOTH_WRONG",] )
[1] 384
nrow( lu_predictions[lu_predictions$origVsH2O == "BOTH_RIGHT",] )
[1] 2557
nrow( lu_predictions[lu_predictions$origVsH2OScaled == "OnlyA",] )
[1] 31
nrow( lu_predictions[lu_predictions$origVsH2OScaled == "OnlyB",] )
[1] 35
nrow( lu_predictions[lu_predictions$origVsH2OScaled == "BOTH_WRONG",] )
[1] 382
nrow( lu_predictions[lu_predictions$origVsH2OScaled == "BOTH_RIGHT",] )
[1] 2561
nrow( lu_predictions[lu_predictions$origVsH2ONorm == "OnlyA",] )
[1] 29
nrow( lu_predictions[lu_predictions$origVsH2ONorm == "OnlyB",] )
[1] 35
nrow( lu_predictions[lu_predictions$origVsH2ONorm == "BOTH_WRONG",] )
[1] 382
nrow( lu_predictions[lu_predictions$origVsH2ONorm == "BOTH_RIGHT",] )
[1] 2563
While sections 11.2 and 10.2 suggest that all the models are behaving similarly, making a judgement solely on accuracy can be deceiving. For this reason, we also decided to dive deeper into the model’s performances, and do additional side by side comparison of other performance metrics.
This evaluateData function was copied directly from the source materials(the additional material included in the original article), as it was the one used to evaluate the original model, and we’ve simply added comments a few extra comments. The H2O library already contains tools to obtain performance metrics from its models, so no additional functions were necessary with the H2O ML models.
evaluateData <- function(RF, data, labels){
require(ROCR)
require(PRROC)
require(randomForest)
res <- list()
res$predD <- predict(RF, data, type = "prob") # Predictions as Generated by the model
res$pred <- prediction(res$predD[,2], as.logical(labels)) # Predictions, but converted to S4 Object
res$PR <- performance(res$pred, measure = "prec", x.measure = "rec") # Precision Recall Curve
res$SS <- performance(res$pred, measure="sens", x.measure="spec") # Sensitivity vs. Specificity
res$auc <- performance(res$pred, measure = "auc") # Sensitivity vs. Specificity AUC
res$acc <- performance(res$pred, measure = "acc") # Accuracy
res$pr <- pr.curve(scores.class0 = res$predD[labels == 1,2], # Precision vs. Recall
scores.class1 = res$predD[labels == 0,2],
curve = T)
return(res)
}
# OrigRF Performance Metrics
origRF_slt2_performance <- evaluateData(origRF, slt2data[,-8], slt2data[,8])
origRF_scaled_slt2_performance <- evaluateData(origRF_scaled, slt2data_scaled[,-8], slt2data_scaled[,8])
origRF_norm_slt2_performance <- evaluateData(origRF_norm, slt2data_norm[,-8], slt2data_norm[,8])
origRF_lu_performance <- evaluateData(origRF, ludata[,-8], ludata[,8])
origRF_scaled_lu_performance <- evaluateData(origRF_scaled, ludata_scaled[,-8], ludata_scaled[,8])
origRF_norm_lu_performance <- evaluateData(origRF_norm, ludata_norm[,-8], ludata_norm[,8])
# H2O RF Performance Metrics
# H2O provides a way to retrieve most of the desired metrics
# http://docs.h2o.ai/h2o/latest-stable/h2o-r/docs/reference/h2o.metric.html
slt2data_h2o <- slt2data
slt2data_h2o[,"Class"] <- as.logical(slt2data_h2o[,"Class"])
rfh2o_slt2_performance <- h2o.performance(rfh2o, newdata = as.h2o(slt2data_h2o))
ludata_h2o <- ludata
ludata_h2o[,"Class"] <- as.logical(ludata_h2o[,"Class"])
rfh2o_lu_performance <- h2o.performance(rfh2o, newdata = as.h2o(ludata_h2o))
slt2data_h2o_scaled <- slt2data_scaled
slt2data_h2o_scaled[,"Class"] <- as.logical(slt2data_h2o_scaled[,"Class"])
rfh2o_slt2_performance_scaled <- h2o.performance(rfh2o, newdata = as.h2o(slt2data_h2o_scaled))
ludata_h2o_scaled <- ludata_scaled
ludata_h2o_scaled[,"Class"] <- as.logical(ludata_h2o_scaled[,"Class"])
rfh2o_lu_performance_scaled <- h2o.performance(rfh2o, newdata = as.h2o(ludata_h2o_scaled))
slt2data_h2o_norm <- slt2data_norm
slt2data_h2o_norm[,"Class"] <- as.logical(slt2data_h2o_norm[,"Class"])
rfh2o_slt2_performance_norm <- h2o.performance(rfh2o, newdata = as.h2o(slt2data_h2o_norm))
ludata_h2o_norm <- ludata_norm
ludata_h2o_norm[,"Class"] <- as.logical(ludata_h2o_norm[,"Class"])
rfh2o_lu_performance_norm <- h2o.performance(rfh2o, newdata = as.h2o(ludata_h2o_norm))
metrics_table <- data.frame("Accuracy" = 1:12)
rownames(metrics_table) <- c("origRF_slt2_perf", "origRF_scaled_slt2_perf", "origRF_norm_slt2_perf",
"rfh2o_slt2_perf", "rfh2o_slt2_perf_scaled", "rfh2o_slt2_perf_norm",
"origRF_lu_perf", "origRF_scaled_lu_perf", "origRF_norm_lu_perf",
"rfh2o_lu_perf", "rfh2o_lu_perf_scaled", "rfh2o_lu_perf_norm")
metrics_table["origRF_slt2_perf","Accuracy"] <- nrow(slt2_predictions[slt2_predictions$origVsOrigScaled == "BOTH_RIGHT" | slt2_predictions$origVsOrigScaled == "OnlyA",] )/
nrow(slt2_predictions)
metrics_table["origRF_scaled_slt2_perf","Accuracy"] <- nrow(slt2_predictions[slt2_predictions$origVsOrigScaled == "BOTH_RIGHT" | slt2_predictions$origVsOrigScaled == "OnlyB",] )/
nrow(slt2_predictions)
metrics_table["origRF_norm_slt2_perf","Accuracy"] <- nrow(slt2_predictions[slt2_predictions$origVsOrigNorm == "BOTH_RIGHT" | slt2_predictions$origVsOrigNorm == "OnlyB",] )/
nrow(slt2_predictions)
metrics_table["rfh2o_slt2_perf","Accuracy"] <- nrow(slt2_predictions[slt2_predictions$origVsH2O == "BOTH_RIGHT" | slt2_predictions$origVsH2O == "OnlyB",] )/
nrow(slt2_predictions)
metrics_table["rfh2o_slt2_perf_scaled","Accuracy"] <- nrow(slt2_predictions[slt2_predictions$origVsH2OScaled == "BOTH_RIGHT" | slt2_predictions$origVsH2OScaled == "OnlyB",])/
nrow(slt2_predictions)
metrics_table["rfh2o_slt2_perf_norm","Accuracy"] <- nrow(slt2_predictions[slt2_predictions$origVsH2ONorm == "BOTH_RIGHT" | slt2_predictions$origVsH2ONorm == "OnlyB",])/
nrow(slt2_predictions)
metrics_table["origRF_lu_perf","Accuracy"] <- nrow(lu_predictions[lu_predictions$origVsH2O == "BOTH_RIGHT" | lu_predictions$origVsH2O == "OnlyA",] )/
nrow(lu_predictions)
metrics_table["origRF_scaled_lu_perf","Accuracy"] <- nrow(lu_predictions[lu_predictions$origVsOrigScaled == "BOTH_RIGHT" | lu_predictions$origVsOrigScaled == "OnlyB",] )/
nrow(lu_predictions)
metrics_table["origRF_norm_lu_perf","Accuracy"] <- nrow(lu_predictions[lu_predictions$origVsOrigNorm == "BOTH_RIGHT" | lu_predictions$origVsOrigNorm == "OnlyB",] )/
nrow(lu_predictions)
metrics_table["rfh2o_lu_perf","Accuracy"] <- nrow(lu_predictions[lu_predictions$origVsH2O == "BOTH_RIGHT" | lu_predictions$origVsH2O == "OnlyB",] )/
nrow(lu_predictions)
metrics_table["rfh2o_lu_perf_scaled","Accuracy"] <- nrow(lu_predictions[lu_predictions$origVsH2OScaled == "BOTH_RIGHT" | lu_predictions$origVsH2OScaled == "OnlyB",])/
nrow(lu_predictions)
metrics_table["rfh2o_lu_perf_norm","Accuracy"] <- nrow(lu_predictions[lu_predictions$origVsH2ONorm == "BOTH_RIGHT" | lu_predictions$origVsH2ONorm == "OnlyB",] )/
nrow(lu_predictions)
metrics_table
The Legend for the graphs: * Blue = OrigRF * Magenta = OrigRF but scaled * Black = OrigRF but normalized * Red = H2O RF model * Brown = H2O RF model with scaled data * Green = H2O RF model with normalized data
# SLT2 Accuracy Plot
plot(origRF_slt2_performance$acc, col = "blue", lwd = 5, main = "SLT2 Accuracy")
lines(as.double(unlist(origRF_scaled_slt2_performance$acc@x.values)), as.double(unlist(origRF_scaled_slt2_performance$acc@y.values)), col = "magenta", lwd = 4)
lines(as.double(unlist(origRF_norm_slt2_performance$acc@x.values)), as.double(unlist(origRF_norm_slt2_performance$acc@y.values)), col = "black", lwd = 1)
lines(h2o.accuracy(rfh2o_slt2_performance), type = "l", col = "red", lwd = 3)
lines(h2o.accuracy(rfh2o_slt2_performance_scaled), type = "l", col = "brown", lwd = 3)
lines(h2o.accuracy(rfh2o_slt2_performance_norm), type = "l", col = "green", lwd = 3)
# LU Accuracy Plot
plot(origRF_lu_performance$acc, col = "blue", lwd = 5, main = "LU Accuracy")
lines(as.double(unlist(origRF_scaled_lu_performance$acc@x.values)), as.double(unlist(origRF_scaled_lu_performance$acc@y.values)), col = "magenta", lwd = 4)
lines(as.double(unlist(origRF_norm_lu_performance$acc@x.values)), as.double(unlist(origRF_norm_lu_performance$acc@y.values)), col = "black", lwd = 1)
lines(h2o.accuracy(rfh2o_lu_performance), type = "l", col = "red", lwd = 3)
lines(h2o.accuracy(rfh2o_lu_performance_scaled), type = "l", col = "brown", lwd = 3)
lines(h2o.accuracy(rfh2o_lu_performance_norm), type = "l", col = "green", lwd = 3)
The Legend for the graphs: * Blue = OrigRF * Magenta = OrigRF but scaled * Black = OrigRF but normalized * Red = H2O RF model * Brown = H2O RF model with scaled data * Green = H2O RF model with normalized data
metrics_table["origRF_slt2_perf","AUCPR"] <- origRF_slt2_performance$pr$auc.integral
metrics_table["origRF_scaled_slt2_perf","AUCPR"] <- origRF_scaled_slt2_performance$pr$auc.integral
metrics_table["origRF_norm_slt2_perf","AUCPR"] <- origRF_norm_slt2_performance$pr$auc.integral
metrics_table["rfh2o_slt2_perf", "AUCPR"] <- h2o.aucpr(rfh2o_slt2_performance)
metrics_table["rfh2o_slt2_perf_scaled", "AUCPR"] <- h2o.aucpr(rfh2o_slt2_performance_scaled)
metrics_table["rfh2o_slt2_perf_norm", "AUCPR"] <- h2o.aucpr(rfh2o_slt2_performance_norm)
metrics_table["origRF_lu_perf", "AUCPR"] <- origRF_lu_performance$pr$auc.integral
metrics_table["origRF_scaled_lu_perf", "AUCPR"] <- origRF_scaled_lu_performance$pr$auc.integral
metrics_table["origRF_norm_lu_perf", "AUCPR"] <- origRF_norm_lu_performance$pr$auc.integral
metrics_table["rfh2o_lu_perf", "AUCPR"] <- h2o.aucpr(rfh2o_lu_performance)
metrics_table["rfh2o_lu_perf_scaled", "AUCPR"] <- h2o.aucpr(rfh2o_lu_performance_scaled)
metrics_table["rfh2o_lu_perf_norm", "AUCPR"] <- h2o.aucpr(rfh2o_lu_performance_norm)
metrics_table
plot(origRF_slt2_performance$PR, col = "blue", lwd = 5, main = "SLT2 PR Curve" )
lines(as.double(unlist(origRF_scaled_slt2_performance$PR@x.values)), as.double(unlist(origRF_scaled_slt2_performance$PR@y.values)), col = "magenta", lwd = 4 )
lines(as.double(unlist(origRF_norm_slt2_performance$PR@x.values)), as.double(unlist(origRF_norm_slt2_performance$PR@y.values)), col = "black", lwd = 2 )
lines(x = h2o.recall(rfh2o_slt2_performance)[,"tpr"],
y = h2o.precision(rfh2o_slt2_performance)[,"precision"],
col = "red", type = "l", lwd = 3
)
lines(x = h2o.recall(rfh2o_slt2_performance_scaled)[,"tpr"],
y = h2o.precision(rfh2o_slt2_performance_scaled)[,"precision"],
col = "brown", type = "l", lwd = 3
)
lines(x = h2o.recall(rfh2o_slt2_performance_norm)[,"tpr"],
y = h2o.precision(rfh2o_slt2_performance_norm)[,"precision"],
col = "green", type = "l", lwd = 3
)
plot(origRF_lu_performance$PR, col = "blue", lwd = 5, main = "LU PR Curve" )
lines(as.double(unlist(origRF_scaled_lu_performance$PR@x.values)), as.double(unlist(origRF_scaled_lu_performance$PR@y.values)), col = "magenta", lwd = 4 )
lines(as.double(unlist(origRF_norm_lu_performance$PR@x.values)), as.double(unlist(origRF_norm_lu_performance$PR@y.values)), col = "black", lwd = 2 )
lines(x = h2o.recall(rfh2o_lu_performance)[,"tpr"],
y = h2o.precision(rfh2o_lu_performance)[,"precision"],
col = "red", type = "l", lwd = 3
)
lines(x = h2o.recall(rfh2o_lu_performance_scaled)[,"tpr"],
y = h2o.precision(rfh2o_lu_performance_scaled)[,"precision"],
col = "brown", type = "l", lwd = 3
)
lines(x = h2o.recall(rfh2o_lu_performance_norm)[,"tpr"],
y = h2o.precision(rfh2o_lu_performance_norm)[,"precision"],
col = "green", type = "l", lwd = 3
)
The Legend for the graphs: * Blue = OrigRF * Magenta = OrigRF but scaled * Black = OrigRF but normalized * Red = H2O RF model * Brown = H2O RF model with scaled data * Green = H2O RF model with normalized data
plot(origRF_slt2_performance$SS, col = "blue", lwd = 5, main = "SLT2 Sensitivity vs Specificity")
lines(as.double(unlist(origRF_scaled_slt2_performance$SS@x.values)), as.double(unlist(origRF_scaled_slt2_performance$SS@y.values)), col = "magenta", lwd = 4 )
lines(as.double(unlist(origRF_norm_slt2_performance$SS@x.values)), as.double(unlist(origRF_norm_slt2_performance$SS@y.values)), col = "black", lwd = 2 )
lines(x = h2o.specificity(rfh2o_slt2_performance)[,"tnr"],
y = h2o.sensitivity(rfh2o_slt2_performance)[,"tpr"],
col = "red", type = "l", lwd = 3
)
lines(x = h2o.specificity(rfh2o_slt2_performance_scaled)[,"tnr"],
y = h2o.sensitivity(rfh2o_slt2_performance_scaled)[,"tpr"],
col = "brown", type = "l", lwd = 3
)
lines(x = h2o.specificity(rfh2o_slt2_performance_norm)[,"tnr"],
y = h2o.sensitivity(rfh2o_slt2_performance_norm)[,"tpr"],
col = "green", type = "l", lwd = 3
)
plot(origRF_lu_performance$SS, col = "blue", lwd = 5, main = "LU Sensitivity vs Specificity")
lines(as.double(unlist(origRF_scaled_lu_performance$SS@x.values)), as.double(unlist(origRF_scaled_lu_performance$SS@y.values)), col = "magenta", lwd = 4 )
lines(as.double(unlist(origRF_norm_lu_performance$SS@x.values)), as.double(unlist(origRF_norm_lu_performance$SS@y.values)), col = "black", lwd = 2 )
lines(x = h2o.specificity(rfh2o_lu_performance)[,"tnr"],
y = h2o.sensitivity(rfh2o_lu_performance)[,"tpr"],
col = "red", type = "l", lwd = 3
)
lines(x = h2o.specificity(rfh2o_lu_performance_scaled)[,"tnr"],
y = h2o.sensitivity(rfh2o_lu_performance_scaled)[,"tpr"],
col = "brown", type = "l", lwd = 3
)
lines(x = h2o.specificity(rfh2o_lu_performance_norm)[,"tnr"],
y = h2o.sensitivity(rfh2o_lu_performance_norm)[,"tpr"],
col = "green", type = "l", lwd = 3
)
Some additional information that’s easily obtainable from the H2O library.
h2o.confusionMatrix(rfh2o_slt2_performance)
Confusion Matrix (vertical: actual; across: predicted) for max f1 @ threshold = 0.644166666567326:
h2o.confusionMatrix(rfh2o_slt2_performance_scaled)
Confusion Matrix (vertical: actual; across: predicted) for max f1 @ threshold = 0.232701931446791:
h2o.confusionMatrix(rfh2o_slt2_performance_norm)
Confusion Matrix (vertical: actual; across: predicted) for max f1 @ threshold = 0.266167617663741:
plot(h2o.F1(rfh2o_slt2_performance), col = "blue", main = "F1 Performance")
lines(h2o.F1(rfh2o_slt2_performance_scaled), col = "red")
lines(h2o.F1(rfh2o_slt2_performance_norm), col = "orange")
plot(rfh2o_slt2_performance, # REMINDER: TPR = Sensitivity, FPR = (1 - specificity)
type = "roc",
col = "red",
cex = 0.2,
pch = 10
)
h2o.confusionMatrix(rfh2o_lu_performance)
Confusion Matrix (vertical: actual; across: predicted) for max f1 @ threshold = 0.496151700350456:
plot(h2o.F1(rfh2o_lu_performance))
plot(rfh2o_lu_performance,
type = "roc",
col = "red",
cex = 0.2,
pch = 10
)
Random Forest Explainer is a global interpretability model that ultimately allows you to better understand your RF model and create visually attractive and useful PDP plots. The methodology for behind RFE is simple: identify the top most important features, their interactions, and then graph these interactions in what is essentially a 2D PDP plot. Unfortunately, when trying to use RFE within the notebook, the outputs generated errors and the plots were simply not showing. To get around this problem, we will simply add the graphs as generated by the RFE library. The code that generated these graphs be found in here, and the output html generated by the explain_forest() function can also be found “./Your_forest_explained.html”. Keep in mind that the HTML file is automatically generated by the library and does not include all the plots we generated (it seems to only show what it considers the main plots). Since the you still have to generate the plots yourself, the HTML file is a cool bonus, but not a necessary.
Distribution of minimal depth and its mean
rfe_importance_frame <- read.csv("./Results/RFE_importance_frame_origRF.csv", sep = ",", header = TRUE)
rfe_importance_frame
Times a root vs Mean min depth vs Number of Nodes
## 15. GGpairs Comparisons The following graphs are simply helpful in identifying feature importance
ggpairs full
ggparis main features
ggpairs importance rankings
Top 30 Feature Interactions
From this graph, we gather that the top 5 feature interactions are: 1. DownDistance:SS 1. Distance:SS 1. DownDistance:Pos10wrtsRNAStart, 1. DistTerm:SS, 1. Distance:Pos10wrtsRNAStart Even though Distance and DownDistance were considered the most important features, their interactions did not appear to be as significant as the others
* The higher(closer to 0) the SS, the higher the confidence in the prediction
* distances greater than -850 => higher chances of having a bona fide sRNA * SS higher than -10 => Lower the prob of having a bona fide sRNA
* No clear disernable takeaway
* SS of 0 seems to virtually guarantee that we don’t have a bona fide sRNA